author: Daniel Chen, Sara Altman id: analyze_data_with_python_using_posit_workbench_and_snowflake summary: Analyze Data with R using Posit Workbench and Snowflake categories: Getting-Started environments: web status: Published feedback link: https://github.com/Snowflake-Labs/sfguides/issues tags: Getting Started, Data Science, R, Posit Workbench, Native Applications

Analyze Data with Python using Posit Workbench and Snowflake

Overview

Duration: 1

In this guide, we’ll use Python to analyze data in Snowflake using the Posit Workbench Native App. You’ll learn how to launch the Posit Workbench Native App and use the available VS Code IDE. You’ll also learn how to use the Ibis library to translate Python code into SQL, allowing you to run data operations directly in Snowflake’s high-performance computing environment.

We’ll focus on a healthcare example by analyzing heart failure data. We’ll guide you through accessing the data and performing data cleaning, transformation, and visualization. Finally, you’ll see how to generate an HTML report, build an interactive Shiny app, and write data back to Snowflake—-completing an end-to-end analysis in Python entirely within Snowflake.

What You’ll Need

  • Familiarity with Python
  • The ability to launch Posit Workbench from Snowflake Native Applications. This can be provided by an administrator with the accountadmin role.

What You’ll Learn

  • How to work in VS Code from Posit Workbench Native App.
  • How to connect to your Snowflake data from Python to create tables, visualizations, and more.

What You’ll Build

  • A VS Code environment to use within Snowflake.
  • A Quarto document that contains plots and tables built with Python, using data stored in Snowflake.
  • An interactive Shiny for Python application built using data stored in Snowflake.

Along the way, you will use Python to analyze which variables are associated with survival among patients with heart failure. You can follow along with this quickstart guide, or look at the materials provided in the accompanying repository: https://github.com/posit-dev/snowflake-posit-quickstart-python.

Setup

Duration: 15

Before we begin there are a few components we need to prepare. We need to:

  • Add the heart failure data to Snowflake
  • Launch the Posit Workbench Native App
  • Create a VS Code session
  • Create a virtual environment and install the necessary libraries

Add the heart failure data to Snowflake

For this analysis, we’ll use the Heart Failure Clinical Records dataset. The data is available for download as a CSV from the UCI Machine Learning Repository.

We’ll walk through how to download the data from UCI and then upload it to Snowflake from a CSV.

aside positive

If you have the necessary permissions in Snowflake, you can also import the data from this S3 bucket: s3://heart-failure-records/heart_failure.csv.

Step 1: Download the data as a CSV

Download the data from UCI here, and then unzip the downloaded file.

Step 2: Add data in Snowsight

Log into Snowsight, then click Create > Add Data. You can find the Create button in the upper-left corner.

Step 3: Load data

Choose the Load Data into a Table option, then select your downloaded heart failure CSV. Specify an existing database or create a new one for the heart failure data (we called ours HEART_FAILURE). Then, select + Create a new table and name it HEART_FAILURE.

Step 4: Confirm data

You should now be able to see the heart failure data in Snowsight. Navigate to Data > Databases, then select the database to which you added the data (e.g., HEART_FAILURE). Expand the database, schema, and tables until you see the HEART_FAILURE table.

Launch Posit Workbench

We can now start exploring the data using Posit Workbench. You can find Posit Workbench as a Snowflake Native Application and use it to connect to your database.

Step 1: Navigate to Apps

In your Snowflake account, Go to Data Products > Apps to open the Native Apps collection. If Posit Workbench is not already installed, click Get. Please note that the Native App must be installed and configured by an administrator.

Step 2: Open the Posit Workbench Native App

Once Posit Workbench is installed, click on the app under Installed Apps to launch the app. If you do not see the Posit Workbench app listed, ask your Snowflake account administrator for access to the app.

After clicking on the app, you will see a page with configuration instructions and a blue Launch app button.

Click on Launch app. This should take you to the webpage generated for the Workbench application. You may be prompted to first login to Snowflake using your regular credentials or authentication method.

Create a VS Code Session

Posit Workbench provides several IDEs, including VS Code, RStudio Pro, and JupyterLab. For this analysis we will use VS Code.

Step 1: New Session

Within Posit Workbench, click New Session to launch a new session.

Step 2: Select an IDE

When prompted, select VS Code.

Step 3: Log into your Snowflake account

Next, connect to your Snowflake account from within Posit Workbench. Under Session Credentials, click the button with the Snowflake icon to sign in to Snowflake. Follow the sign in prompts.

When you’re successfully signed into Snowflake, the Snowflake button will turn blue and there will be a checkmark in the upper-left corner.

Step 4: Launch VS Code

Click Start Session to launch VS Code.

Once everything is ready, you will be able to work with your Snowflake data in VS Code. Since the IDE is provided by the Posit Workbench Native App, your entire analysis will occur securely within Snowflake.

Step 5: Install Quarto and Shiny Extensions

The Quarto and Shiny VS Code Extensions support the development of Quarto documents and Shiny apps in VS Code. The Jupyter extension provides support for running Python code in notebook cells

Install these extensions:

  1. Open the VS Code Extensions view. On the right-hand side of VS Code, click the Extensions icon in the Activity bar to open the Extensions view.
  2. Search for “Quarto” to find the Quarto extension.

  1. Install the Quarto extension. Click on the Quarto extension, then click Install.
  2. Install the Shiny extension. Search for the Shiny extension, then install the extension in the same way.
  3. Install the Jupyter extension. Search for the Jupyter extension, then install the extension in the same way.

You can learn more about these extensions here: Shiny extension, Quarto extension.

Step 6: Access the Quickstart Materials

This Quickstart will step you through the analysis contained in https://github.com/posit-dev/snowflake-posit-quickstart-python/blob/main/quarto.qmd. To follow along, clone the GitHub repo:

  1. Once the session opens, click Clone Git Repository.

  1. Paste the URL (https://github.com/posit-dev/snowflake-posit-quickstart-python/) of the GitHub repo into the Command Palette.

  2. Click Clone from URL.

  1. Choose a folder to clone the repo into.

  1. Follow the prompts to authenticate to GitHub.

  2. When prompted, click Open to open the cloned repository.

  1. Open quarto.qmd.

Create a virtual environment

Create a virtual environment and install dependencies from the provided requirements.txt file:

  1. Open the Command Palette (Ctrl/Cmd+Shift+P), then search for “Python: Create Environment”.
  2. Select Venv to create a .venv virtual environment.
  3. Select a Python version to use for the environment.
  4. When prompted to select dependencies to install, check requirements.txt. If you don’t see this prompt, open a terminal and run pip install -r requirements.txt after the environment is created.
  5. Click OK. VS Code will install the required packages and create a virtual environment in your current working directory (which should be the folder created from the GitHub repo).

See the Python Environments in VS Code section of the Posit Workbench User Guide to learn more about Python environments in Posit Workbench.

Access Snowflake data from Python

Duration: 5

We’ll run our code in our Python environment, but the code will use data stored in our database on Snowflake.

To access this data, we’ll use the Ibis library to connect to the database and querty the data from Python, without having to write raw SQL. Let’s take a look at how this works.

Connect with Ibis

Ibis is an open source dataframe library that works with a wide variety of backends, including Snowflake.

First, we import ibis, then use ibis.snowflake.connect to connect to the Snowflake database. We need to provide a warehouse for compute and a database to connect to. We can also provide a schema here to make connecting to specific tables easier.

import ibis 

con = ibis.snowflake.connect(
  warehouse="DEFAULT_WH",  
  database="HEART_FAILURE",  
  schema="PUBLIC",
  connection_name="workbench"
)

aside positive

Your warehouse, database, catalog values may differ, depending on your available warehouses and how you named your database.

The variable con now stores our connection.

Create a table that corresponds to a table in the database

Once we build a connection, we can use table() to create an Ibis table expression that represents the database table.

heart_failure = con.table("HEART_FAILURE")

Rely on Ibis to translate Python to SQL

We can now use Ibis to interact with heart_failure. For example, we can filter rows and select columns from our data.

heart_failure_filtered = (
    heart_failure.filter(heart_failure.AGE < 50)  
    .rename(
        {
            "age": "AGE",
            "diabetes": "DIABETES",
            "serum_sodium": "SERUM_SODIUM",
            "serum_creatinine": "SERUM_CREATININE",
            "sex": "SEX",
            "death_event": "DEATH_EVENT",
        }
    )
    .select(
        ["age", "diabetes", "serum_sodium", "serum_creatinine", "sex", "death_event"]
    )
)

Right now, heart_failure_filtered is still a table expression. Ibis lazily evaluates commands, which means that the full query is never run on the database unless explicitly requested.

You can force Ibis to compile the table expression into SQL and run that SQL on Snowflake with .execute() or .to_pandas().

heart_failure_filtered.execute()

If we want to see the SQL code that Ibis generates, we can run ibis.to_sql().

ibis.to_sql(heart_failure_filtered)
SELECT
  "t0"."AGE" AS "age",
  "t0"."DIABETES" AS "diabetes",
  "t0"."SERUM_SODIUM" AS "serum_sodium",
  "t0"."SERUM_CREATININE" AS "serum_creatinine",
  "t0"."SEX" AS "sex",
  "t0"."DEATH_EVENT" AS "death_event"
FROM "HEART_FAILURE" AS "t0"
WHERE
  "t0"."AGE" < 50

In summary

This system:

  1. Keeps our data in the database, saving memory in the Python session.
  2. Pushes computations to the database, saving compute in the Python session.
  3. Evaluates queries lazily, saving compute in the database.

We don’t need to manage the process, it happens automatically behind the scenes.

You can learn more about Ibis here. Take a look at the Snowflake backend documentation to learn more about using Ibis to interact with Snowflake specifically.

Write to a Snowflake database

Duration: 1

You can also use Ibis to create a new table in a database or append to an existing table.

To add a new table, use create_table().

con.create_table("HEART_FAILURE_FILTERED", heart_failure_filtered)

To insert data into an existing table, use insert().

Now that we understand how to interact with our database, we can use Python to perform our analysis.

Prepare data with Ibis

Duration: 5

We want to understand which variables in HEART_FAILURE are associated with survival of patients with heart failure.

First we convert the column names to lowercase, so we won’t need to worry about capitalization.

heart_failure = heart_failure.rename(
    {
        "age": "AGE",
        "diabetes": "DIABETES",
        "serum_sodium": "SERUM_SODIUM",
        "serum_creatinine": "SERUM_CREATININE",
        "sex": "SEX",
        "death_event": "DEATH_EVENT",
    }
)

Filter ages

For now, we’ll focus on patients younger than 50. We also reduce the data to just the columns we’re interested in.

heart_failure_filtered = (
    heart_failure
    .filter(heart_failure.age < 50)  # Filter to age < 50
    .select(["age", "diabetes", "serum_sodium", "serum_creatinine", "sex", "death_event"])
)

aside positive

By default, Ibis is in deferred mode and lazily evaluates table expressions. To more easily interact with your tables and see the results of your code, you can turn on interactive code. In interactive mode, expressions are executed when printed to the console. Turn on interactive mode with ibis.options.interactive = True.

Visualize Data with plotnine

Duration: 5

The heart failure data provides important insights that can help us:

  • Identify factors associated with increased risk of mortality after heart failure.
  • Predict future survival outcomes based on historical clinical data.
  • Benchmark patient outcomes based on clinical indicators like serum sodium levels.

Visualizing clinical variables across different patient groups can help identify patterns.

Visualize serum sodium levels

We can use plotnine to visually compare sodium levels across different patient groups. In this plot, we see the distribution of serum sodium based on whether the patients have diabetes and whether they survived (0) or died (1) during the follow-up period.

heart_failure_df = heart_failure_filtered.execute()

# Convert columns to strings for plot
heart_failure_df['death_event'] = heart_failure_df['death_event'].astype(str)
heart_failure_df['diabetes'] = heart_failure_df['diabetes'].astype(str)

(
    ggplot(heart_failure_df, aes(x='death_event', y='serum_sodium', color='diabetes')) +
    geom_boxplot() +
    labs(
        title="Serum Sodium Levels by Diabetes Status and Survival Outcome",
        x="Survival Outcome (0 = Survived, 1 = Died)",
        y="Serum Sodium (mEq/L)",
        color="Diabetes"
    ) +
    theme(legend_position='bottom')
)

aside positive

We first run .execute() to force the table expression to execute. This makes it easier to change the column types in preparation for plotting. We don’t need to worry about unecessarily executing the table expression because plotting requires evaluation anyways.

Make publication-ready tables with Great Tables

Duration: 5

Next, we’ll use Ibis to calculate the median values for various clinical metrics across different patient groups.

(
    heart_failure_filtered
    .group_by(["death_event", "diabetes"])
    .aggregate(
        median_age=heart_failure_filtered["age"].median(),
        median_serum_creatinine=heart_failure_filtered["serum_creatinine"].median(),
        median_serum_sodium=heart_failure_filtered["serum_sodium"].median()
    )
)

This is a useful way to examine the information for ourselves. However, if we wish to share the information with others, we might prefer to present the table in a more polished format. We can do this with the Great Tables package.

The following code prepares a table named comparison that we’ll display with Great Tables.

comparison = (
    heart_failure_filtered
    .group_by(["death_event", "diabetes"])
    .aggregate(
        median_age=heart_failure_filtered["age"].median(),
        median_serum_creatinine=heart_failure_filtered["serum_creatinine"].median(),
        median_serum_sodium=heart_failure_filtered["serum_sodium"].median()
    )
    .mutate(
        death_event=ibis.ifelse(heart_failure_filtered["death_event"] == 1, "Died", "Survived"),
        diabetes=ibis.ifelse(heart_failure_filtered["diabetes"] == 1, "Yes", "No"),
        median_serum_creatinine=heart_failure_filtered["serum_creatinine"].median().cast("float64")
    )
    .rename(
        {
            'Survival Outcome': 'death_event',
            'Diabetes Status': 'diabetes',
            'Median Age': 'median_age',
            'Median Serum Creatinine (mg/dL)': 'median_serum_creatinine',
            'Median Serum Sodium (mEq/L)': 'median_serum_sodium'
        }
    )
)

Next, we use GT() and other Great Tables functions to create and style a table that displays comparison. Note that we need to evaluate comparison with .execute() first because GT() only accepts Pandas or Polars DataFrames.

(
    GT(comparison.execute())
    .tab_header(title="Clinical Metrics by Survival Outcome and Diabetes Status")
    .fmt_number(
        columns=["Median Age", "Median Serum Creatinine (mg/dL)", "Median Serum Sodium (mEq/L)"],
        decimals=1
    )
    .data_color(
        columns=["Median Serum Creatinine (mg/dL)", "Median Serum Sodium (mEq/L)"],
        palette=["white", "blue"]
    )
)

Now that we’ve accumulated some insights, let’s think about how we might present the results of our analysis to our colleagues.

Build Reports and Dashboards with Quarto

Duration: 2

We’ve conveniently written our analysis in a Quarto (.qmd) document, quarto.qmd. Quarto is an open-source publishing system that makes it easy to create data products such as documents, presentations, dashboards, websites, and books.

By placing our work in a Quarto document, we have interwoven all of our code, results, output, and prose text into a single literate programming document. This way everything can travel together in a reproducible data product.

A Quarto document can be thought of as a regular markdown document, but with the ability to run code chunks.

You can run any of the code chunks by clicking the Run Cell button above the chunk in VS Code.

When you run a cell, cell output is displayed in the Jupyter interactive console.

To render and preview the entire document, click the Preview button or run quarto preview quarto.qmd from the terminal.

This will run all the code in the document from top to bottom and and generate an HTML file, by default, for you to view and share.

Learn More about Quarto

You can learn more about Quarto here: https://quarto.org/, and the documentation for all the various Quarto outputs here: https://quarto.org/docs/guide/. Quarto works with Python, R, and Javascript Observable code out-of-the box, and is a great tool to communicate your data science analyses.

Shiny Application

Duration: 2

One way to share our work and allow others to explore the heart failure dataset is to create an interactive Shiny app.

We’ve prepared an example Shiny app in the directory: https://github.com/posit-dev/snowflake-posit-quickstart-python. Our app allows the user to explore different clinical metrics in one place.

To run the app, open app/app.py and then click the Run Shiny App button at the top of the script in VS Code.

Change the metric in the sidebar to control which metric is plotted.

Learn More About Shiny

You can learn more about Shiny at: https://shiny.posit.co/.

If you’re new to Shiny, you can try it online with shinylive. Shinylive is also available for R version.

Conclusion and Resources

Duration: 2

Python is a powerful, versatile tool for data science, and combined with Snowflake’s high-performance data capabilities, it enables robust, end-to-end data workflows. Using the Posit Workbench Native Application, you can securely work with Python within Snowflake while taking advantage of tools like Ibis, Quarto, and Shiny for Python to analyze, visualize, and share your results.

What You Learned

  • How to use VS Code within the Posit Workbench Native App.
  • How to connect to your Snowflake data from Python to create tables, visualizations, and more.
  • How to create a Quarto document containing plots and tables built in Python, using data stored in Snowflake.
  • How to build an interactive Shiny for Python application, working with data stored in Snowflake.